Skip to content

v3.2: Guidance on searching and evaluating schemas #4743

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: v3.2-dev
Choose a base branch
from

Conversation

handrews
Copy link
Member

NOTE 1: This is intended to clarify requirements that already exist but have never been well-defined, both by making certain things required and stating clearly that other things are not. It is particularly relevant in light of the Encoding Object changes, although the vaguely-defined behavior predates the new features.

NOTE 2: I wasn't sure whether to put this here or under the Schema Object, or some other arrangment- suggestions welcome!

Some OAS features casually state that they depend on the type of data being examined, or implicitly carry ambiguity about how to determine how to parse the data.

This section attempts to provide some guidance and limits, requiring only that implementations follow the unambiguous, statically deterministic keywords $ref and allOf.

It also provides for just validating the data (when possible) and using the actual in-memory type when a schema is too complex to analyze statically.

One use of this is breaking apart schemas to use them with mixed binary and JSON-compatible data, and a new section has been added to address that.

Finally, a typo in a related section was fixed.

  • schema changes are included in this pull request
  • schema changes are needed for this pull request but not done yet
  • no schema changes are needed for this pull request

Some OAS features casually state that they depend on the type
of data being examined, or implicitly carry ambiguity about how
to determine how to parse the data.

This section attempts to provide some guidance and limits, requiring
only that implementations follow the unambiguous, statically
deterministic keywords `$ref` and `allOf`.

It also provides for just validating the data (when possible) and
using the actual in-memory type when a schema is too complex
to analyze statically.

One use of this is breaking apart schemas to use them with mixed
binary and JSON-compatible data, and a new section has been
added to address that.

Finally, a typo in a related section was fixed.
@handrews handrews added this to the v3.2.0 milestone Jun 21, 2025
@handrews handrews requested a review from a team as a code owner June 21, 2025 01:12
@handrews handrews requested a review from a team as a code owner June 21, 2025 01:12
@handrews handrews added the media and encoding Issues regarding media type support and how to encode data (outside of query/path params) label Jun 21, 2025
src/oas.md Outdated

When the data is in a non-JSON format, particularly one such as XML or various form media types where data is stored as strings without type information, it can be necessary to find this information through the relevant Schema Object to determine how to parse the format into a structure that can be validated by the schema.
As schema organization can become very complex, implementations are not expected to handle every possible schema layout.
However, given a known starting point schema (usually the value of the nearest `schema` field), implementations MUST search the following for the relevant keywords (e.g. `type`, `format`, `contentMediaType`, etc.):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this suggesting that if contentMediaType is present at the top level of the schema that the entire data instance should be decoded with that media type before passing to validation? That would conflict with the expectation by the schema itself that the data instance is a string and that contentSchema should be used for validation of the decoded instance -- if the instance is already decoded to an object, contentSchema will do nothing. I think we should recommend that an explicit media type should be specified instead, which is available for all places that a schema is used (parameters as well as message bodies).

Additionally, we could specify that schema keywords that require a specific type (e.g. uniqueItems, properties) can be used to infer a specific data type for the instance. I also wouldn't mind if we didn't do this, and instead required the user to be explicit with a type keyword, so an implementation doesn't have to enumerate all of the type-specific keywords present in JSON Schema (it's version specific, e.g. an OAD that happened to use the a schema dialect under draft2019-09 would need to include additionalItems -- and asking the OAD parser to also have to parse the $schema keyword is going a little far).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is this suggesting that data deeper within the instance (for example items in an array) need type inference as well, or is this intended only as a mechanism for determining the type of the overall data instance?

If the latter, there is no need to perform schema inspection for an XML document, as surely the media type would be indicating that. The only application I've found in my implementation for needing type inference is for the parameter style and explode features, which require an array or object to operate on and do not use media types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karenetheridge

Is this suggesting that if contentMediaType is present at the top level of the schema that the entire data instance should be decoded with that media type before passing to validation?

No. I'll have to go back over this in detail to try to figure out how it seems to be implying that and improve it. I'm trying to cover several scenarios here and I probably muddled them together a bit.

Additionally, we could specify that schema keywords that require a specific type

I spent too much effort ripping a rule somewhat like that out of 3.x (type: array required items in 3.0) to be willing to do such a thing now (and I appreciate that you would rather not do that either). More seriously, the goal here is to strike a balance of allowing some authoring flexibility in arranging schemas without requiring implementors to do too much. Inferring types from other keywords, which can quickly get contradictory when dealing with dynamically typed structures, is definitely over the line into "too much" for me.

and instead required the user to be explicit with a type keyword

I don't want to outright require one in any specific place either. I'd like to make it clear where things definitely MUST work, and make it clear that there are a lot of things beyond that that won't be expected to work. But there might be unexpected configurations that will work fine in practice, or that specific tools support (perhaps already implemented when deciding how to handle the ambiguity in the first place).

Also, is this suggesting that data deeper within the instance (for example items in an array) need type inference as well, or is this intended only as a mechanism for determining the type of the overall data instance?

It's for whatever requirements in the spec need this kind of things. Correlating Encoding Objects with Schema Objects is probably the most obvious and complicated. And then once you've done that, figuring out the default contentType. Other use cases are figuring out the destination type of ambiguous strings (e.g. query parameters).

If the latter, there is no need to perform schema inspection for an XML document, as surely the media type would be indicating that.

No this is the "where data is stored as strings without type information" use case, where it might be unclear how to parse <foo attr="true">42</foo>. Are those the boolean true and the number 42 or the strings "true" and "42" or some combination? That's distinct from the multipart part use case or other embedded media type use case where you might need to figure out a media type that does not have a Media Type Object with parent key- see PR #4744 for this specific use case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karenetheridge to make this part more clear:

But there might be unexpected configurations that will work fine in practice

For example, if you set the Encoding Object's contentType explicitly, there's no need to locate any schema information to determine its default. But also I don't want complex rules around "if you don't set contentType explicitly, your schema MUST look like...." Rather, if it is unclear whether tooling can detect the default behavior, authors should heed the limitations in the spec and decide to set contentType explicitly themselves.

Similarly, when parsing query parameters and trying to figure out the type, maybe tools just try parsing them in that type order I give, on the grounds that the precedence is probably right. Basically treating it as an "any" type and guessing. As long as they validate it afterwards, that should be pretty reliable (I say "should" because somewhere there's some schema that could work out either of two ways and was intended to go the other way, but I can't figure out what that would look like off the top of my head).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to outright require one in any specific place either. I'd like to make it clear where things definitely MUST work, and make it clear that there are a lot of things beyond that that won't be expected to work.

One scenario here where a type keyword is required is for a header parameter -- an incoming value of "blue,black,brown" could be legitimately parsed as a string or an array, and we'd need to look at the schema to know which is expected. If we see "type":"array" at the top level of the schema then we can confidently parse into the array ["blue", "black", "brown"] and at least have a chance of validating against the schema as a whole (whereas leaving it as a string will definitely be invalid).

In my implementation I don't do any of the guessing that you describe (try parsing into one type, see if it's valid, then parse it a different way and try again). The schema gets one crack at trying to validate the data, and if it failed to specify what type was expected, too bad!

It's nice that this is getting into the spec in clear language. When I implemented parameter parsing I was following https://swagger.io/docs/specification/v3_0/serialization/ and tried to reverse-engineer what they were assuming, since these docs are all written from the client-side perspective (constructing an HTTP message from raw data) as opposed to the parsing and deserializing perspective.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema gets one crack at trying to validate the data, and if it failed to specify what type was expected, too bad!

There are too many scenarios where it's reasonable not to have a type right there in the inline schema. If nothing else, when there is a $ref. I wrestled a bit with how much to require, but following $ref and allOf seems straightforward enough. But I am open to other requirements. I have just seen plenty of implementations (including one I worked on - not publicly available sadly) that search those and roll up examples or flatten allOfs to make the documentation rendering better. It seemed like a reasonable amount of searching.

I also considered allowing implementations to put some sort of step limit on it, like not following more than X number of $refs, but it's not like that gets any harder the more you do, and figuring out how to count "steps" with both $ref and allOf hurt my brain :-)


One scenario here where a type keyword is required is for a header parameter -- an incoming value of "blue,black,brown" could be legitimately parsed as a string or an array, and we'd need to look at the schema to know which is expected.

This is a really good point as "type": ["array", "string"] would be a problem for that parameter. I think the rules I wrote would end up with it treated as an array, as string is always the type of last resort. But do we want that? Unlike the use cases I was thinking of, this truly could go either way, so we're picking a winner here based on rules I made up as I wrote them.

The one multi-value type use case that I think we MUST (heh) require support for is adding "null" to another type (e.g. "type": ["number", "null"]). If we think the multi-type rules are too complex or too likely to have unexpected behavior, we could just special-case it for one non-"null" type plus "null". I'm really uncertain as to what is best here and would love to hear more opinions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really good point as "type": ["array", "string"] would be a problem for that parameter. I think the rules I wrote would end up with it treated as an array, as string is always the type of last resort.

That's what I do -- array is selected first if it's possible, then object, then string is the final fallback as the "original" format of the data.

BTW after you created this PR I expanded my type checking to look for allOfs, which it does recursively. It was already following $refs for this (bombing out after a configurable max stack depth value is reached). I pondered adding type inference from seeing array- or object-specific keywords but decided to shelve that for now. But allOf is a quite reasonable addition.

Copy link
Member Author

@handrews handrews Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karenetheridge so do you think we should keep the type precedence list for multi-valued type? Sounds like you were doing something close to this?

Regarding the header parameter, I did some digging and there seem to be three cases:

  1. Headers that cannot be list-valued, in which case blue,black,brown need not be quoted and is just a string
  2. Headers that can be list-valued in the normal way, where the values can appear on a single line, separated by commas, in which case blue,black,brown is an array of strings but "blue,black,brown" is a string (because of the quotes... "blue","black","brown" would also be an array, with unnecessary but allowed quotes)
  3. Set-Cookie, which has to be treated differently from everything else per its own RFC and per RFC9110 (because it can be list-valued, but it also uses commas in unquoted values so you can't use commas as a delimiter on a single line, you have to put it on multiple lines)

@handrews
Copy link
Member Author

@karenetheridge while I have your attention, do you think this is fine where it is or should it go under the Schema Object somewhere? I really could not decide.

@handrews handrews marked this pull request as draft June 22, 2025 03:57
@handrews
Copy link
Member Author

handrews commented Jun 22, 2025

I'm putting this in draft because based on @karenetheridge's feedback I'm going to rework it fairly substantially, but it's still of use when understanding how it fits with the other related PRs.

The effect of the rewrite should be the same, but I think the wording and organization will be significantly different. It's clear that the different use cases here need to be separated out and clarified. I think this ended up being a bit oddly abstract because of how I tried to split things up into PRs that don't conflict.

Move things under the Schema Object, organize by use case and
by the point in the process at which things occur, and link
directly from more parts of the spec so that the parts in
the Schema Object section can stay more focused.
@handrews
Copy link
Member Author

I have added a commit that almost totally rewrites this- you probably just want to review the whole thing and not look at the per-commit diff as it will be a mess. The new version:

  • Puts most things under the Schema Object
  • Organizes use cases by the point in the process they occur relative to schema evaluation
  • Links from elsewhere in the spec so that we do not need to include quite as much in the main part of the text

I do not think that has changed anything substantial, but it's essentially a new PR now.

@handrews handrews marked this pull request as ready for review June 22, 2025 22:08
@handrews
Copy link
Member Author

@karenetheridge I'm going to mark various threads as resolved since the text is now so different that they are confusing- please do not take that to mean I'm dismissing open questions, please just re-start whatever is needed with comments on the new text, or as new top-level comments. Apologies for the inconvenience.

src/oas.md Outdated

###### Schema Evaluation and Binary Data

As noted under [Working with Binary Data](#working-with-binary-data), Schema Objects for binary documents do not use any standard JSON Schema assertions, as the only ones that could apply (`const` and `enum`) would require embedding raw binary into JSON which is not possible.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well not quite.. there's also minLength and maxLength. But I think I'd be inclined to check the Content-Length header for this instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karenetheridge oh good point, we even say something about maxLength in binary streaming... I'll rework this a bit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karenetheridge fixed in the most recent commit.


However, `multipart` media types can mix binary and text-based data, leaving implementations with two options for schema evaluations:

1. Use a placeholder value, on the assumption that no assertions will apply to the binary data and no conditional schema keywords will cause the schema to treat the placeholder value differently (e.g. a part that could be either plain text or binary might behave unexpectedly if a string is used as a binary placeholder, as it would likely be treated as plain text and subject to different subschemas and keywords).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"format": "binary" would be an ideal signal for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not a format in 3.1+. the appropriate signal would be "contentMediaType": "application/octet-stream" or similar. But because of how the Encoding Object works, it's all more complicated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many other binary media types, e.g. image/jpeg, where it's not useful to apply a schema to the content and we might want to use a placeholder.

I'm now wondering if we need a signal for "do not bother trying to deserialize this content for the purpose of validation". At present my implementation always applies the appropriate media type deserialization to the content and applies the schema to it (and throws an error if it doesn't know how to process that media type, or if the deserializer encountered an error), but bypassing that might be useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karenetheridge I meant "some binary media type in contentMediaType" more than specifically application/octet-stream, but arguably tacking "or similar" on as I did was not sufficiently clear!

I'm open to better ideas here. Perhaps it is best to remove the placeholder idea as it is pretty squishy and can easily cause problems, and just document the schema search (and break apart) process, which leverages things implementations have to do anyway. We could include a brief note about the possibilities and dangers of using a placeholder instead? idk.

I'm now wondering if we need a signal for "do not bother trying to deserialize this content for the purpose of validation". At present my implementation always applies the appropriate media type deserialization to the content and applies the schema to it (and throws an error if it doesn't know how to process that media type, or if the deserializer encountered an error), but bypassing that might be useful.

I'm not entirely sure what to do with this. I sort-of follow you, but I would not be surprised if you have the only JSON Schema implementation that can handle this at all, and I want to write the requirements for the typical case.

src/oas.md Outdated

If this is not possible, the schemas MUST be searched to see if the information can be determined without performing evaluation.
As schema organization can become very complex, implementations are not expected to handle every possible schema layout.
However, given a known starting point schema (usually the value of the nearest `schema` field), implementations MUST search the following for the relevant keywords (e.g. `type`, `format`, `contentMediaType`, `properties`, `prefixItems`, `items`, etc.):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we say MUST, we can't give an incomplete example list of keywords - we need to spell out exactly which ones are supported.

Instead we could say "..implementations SHOULD search the following for relevant keywords such as ..., and MUST document which keywords are supported."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MUST for that is at the point of each use. So in some cases you MUST look for type but in otherse you MUST look for properties, etc.... it's not always the same list.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see what you mean.

Copy link
Member Author

@handrews handrews Jun 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karenetheridge I added some language to clarify this in the most recent commit [EDIT: 2nd-most-recent commit now].

src/oas.md Outdated

When a `type` keyword with multiple values (e.g. `type: ["number", "null"]`) is found, implementations MUST attempt to use the types as follows, ignoring any types not present in the `type` list:

1. Determine if the data can be parsed as whichever of `null`, `number`, `object`, or `array` are present in the `type` list, treating `integer` as `number` for this step.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also parse booleans from the strings "0", "1", "false", "true". I have often seen query parameters using 0 or 1 as a flag value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's basically why string is the last. I'm not sure what to do here and honestly it's a strong argument for saying that the only thing people need to support is type: ["X", "null"] and let everything else be implementation-defined. Because if you have an instance of 0 or the empty string and type: "boolean" it's easy enough to parse it as false, but if you have type: ["boolean", "number"] then how do you know whether 0 is the boolean false or the number 0? I am increasingly inclined to not try to solve this level of type determination. 🤔

handrews and others added 2 commits June 27, 2025 15:06
Co-authored-by: Karen Etheridge <[email protected]>
Also clarify that there is no one set list of keywords to search
for, but rather each use case defines what is relevant.
@handrews
Copy link
Member Author

@karenetheridge I trimmed back the multi-valued type requirements as from our discussion I just see too many ways it can go wrong. Now it's just "if you have [X, "null"] treat it like X" and everything else is optional guidance. How does that sit with you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
media and encoding Issues regarding media type support and how to encode data (outside of query/path params) schema-object
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants